# Instructions
This archive utilizes conda environments to reproduce results. Thus, you will need a working Anaconda distribution. Once a distribution is installed, follow the steps below to reproduce results:
1. **Determine the right environment for your desired level of reproduction:** If you want to re-run the analysis that recreates all of the tables and plots in the paper and appendices we've provided all of the data and a simple environment (debate_analysis.yml) to do this. If you want to refresh all of the data by re-classifying documents with the LLMs and re-train models, then you will need to create the debate_cuda environment. 
2. **Install the appropriate conda environment:** See the section below for instructions on installing the right environment.
3. **Reproduce results with run.py**: run.py accepts arguments that allow you to specify which sections of the archive you would like to reproduce and which computing platform you are using. It is designed to be run from the console, and will produce a file in the logs folder with the results of the run. See the **Using run.py** for further details.

# Environments
This repo has three different conda environments for reproducing different parts of the manuscript.

### debate_analysis.yml
This is a light-weight CPU environment for reproducing the results in the analysis folder. It does not install any of the dependencies necessary for using an Apple or Nvidia GPU and thus cannot be used to refresh the data. To create the environment, run `conda env create -f debate_analysis.yml` in the directory.

### debate_cuda.yml
This is the environment for reproducing any GPU bound tasks. It can also run anything covered by debate_analysis.yml. It is designed for a linux environment and was originally run in Ubuntu 24.04. It MUST be recreated by running the create_cuda bash script. Simply creating an environment from the debate_cuda.yml file will not install the flash attention package necessary for the modern BERT models. To create the environment, run `./create_cuda.sh` in a console opened in the directory.

### debate_mps.yml
This is the environment for reproducing any of the Apple MPS benchmarks. It can also run anything covered by debate_analysis.yml. To create the environment, run `conda env create -f debate_mps.yml` in the directory.

# Using run.py
run.py is meant to be run from the command line and will run all of the scripts in the specified folder, using the specified hardware, while creating a log of outputs. It accepts the following command line options:
- hardware: Specifies the computing platform for scripts that utilize the language models. Can be one of "cuda", "cpu", or "mps". Use cuda for an Nvidia GPU, mps for Apple silicon, and cpu for a CPU. Note that scripts in the cpu, mps, and gpu benchmarking folders are hard coded to run with the appropriate hardware, and will not respect this argument. Default value is cuda if none is specified.
- sections: Specifies which folder to run. Can be one of analysis, labeling, gpu_benchmarks, cpu_benchmarks, mps_benchmarks, or llama.

It's recommended that you use the `--continue-on-error` argument when using run.py. This will allow you to more quickly identify any errors across the entire run should any arise, rather than stopping the entire run at the first error.

### Example usage
To run the analysis section:

```python run.py --sections analysis --continue-on-error```

To refresh the data using an Nvidia GPU

```python run.py --sections labeling --hardware cuda --continue-on-error```

To re-run the timing benchmarks for Apple Silicon:

```python run.py --sections mps_benchmarks --hardware mps --continue-on-error```

# Folders For Reproducing Results
### analysis
**Run Time**: ~ 3 hours

Contains scripts required to reproduce all numbers, figures, and tables in the main paper and appendix. These scripts can be run with a CPU only environment. The name of each script corresponds to the section of the manuscript it replicates results for.

### figures 
All figures generated by scripts in the analysis folder

### tables 
All tables genergated by scripts in the analysis folder.

### data 
All data files necessary for replication. Data is refreshed by the labeling, benchmark, and llama folders.

# Folders For Refreshing the Data
### labeling 
**Run Time**: ~ 8 hours on an RTX 3090

Contains scripts for re-classifying all documents used in the paper. This folder will not generate any new tables or figures, but will refresh the data and labels in the data folder. Scripts are named after their associated sections of the paper. Across all models and random seeds, this entails few-shot training and reclassifying datasets well over 100 times. A GPU should be considered mandatory to re-run the scripts in this folder. 12 GB of vram should be sufficient, but 24GB is recommended.

### cpu_benchmarks
**Run Time**:

Re-runs classification speed benchmarks on a CPU. This re-creates the data used for the tables and plots created in the analysis folder. Do not expect similar results unless you are running the script on the same CPU used in the paper (Ryzen 9900X).

### mps_benchmarks 
**Run Time**: ~ 1.5 hours on an M3 Max

Timing benchmarks for Apple silicon. This re-creates the data used for the tables and plots created in the analysis folder. Run in the original paper on an M3 Max. 

### gpu_benchmarks
**Run Time**:

Timing benchmarks for the GPU. This re-creates the data used for the tables and plots created in the analysis folder. Run in the original paper on an RTX 3090.

### llama
**Run Time**:

Scripts for zero-shot and few-shot classification with Llama 3.1 8B. A GPU with at least 24GB of memory is required for these scripts. These are quarantined from the other labeling scripts because they are particularly slow to run and require a larger GPU. **Note: To run any script using Llama 3.1 8B you will need permission and a read key for the Hugging Face Hub.** If you do not have one, the authors may be able to provide a temporary one for replication purposes. Once you have a key, paste it in llama_key.txt and the scripts will read it in.

# logs 
This folder contains output logs from replication runs.

# Data
Here is a brief explanation of each file contained in the data folder.
- alt_hypotheses.csv: Labels generated from the DEBATE models using different hypotheses for the same document and classification task. Used in Appendix E.
- covid_classified_tweets.csv: Tweets and their labels for the replication in section 6.
- covid_fewshot_base.csv: Experiment results for few-shot training DEBATE Base in section 6.
- covid_fewshot_large.csv: Experiment results for few-shot training DEBATE Large in section 6.
- covid_llama_peft_res.csv: Experiment results for few-shot training Llama 3.1 8B in section 6.
- covid_textless_tweets.csv: A lightweight version of covid_classified_tweets.csv that contains only the tweet ID and labels.
- covid_tweets_labeled.csv: The training set for the COVID classifier. Compiled by Block et. al. and used for few-shot training.
- covid_twitter_users.csv: Twitter user data to recreate the regressions in section 6.
- cpu_timings.csv: Experiment results for cpu benchmarks.
- cuda_timings.csv: Experiment results for gpu benchmarks.
- deliberative_politics.csv: Deliberative politics dataset used in Appendix E.
- fewshot_overfit_res.csv: Results from few-shot training experiments in Appendix E
- freedom_test.csv: Mood of the Nation data for Appendix F
- hypothesis_variants.csv: Dataset containing all hypotheses in the PolNLI test set and their alternatives.
- llama_motn_25shot.csv: Mood of the Nation responses with labels from Llama 3.1 8B, both zero-shot and 25-shot.
- manual_validation.csv: Test set documents manually validated by the authors.
- motn_fewshot_base.csv: Results from the few-shot training experiment in Appendix F for the base model.
- motn_fewshot_large.csv: Results from the few-shot training experiment in Appendix F for the large model.
- mps_timings.csv: Experiment results for mps benchmarks.
- nli_bench.csv: Documents and labels for the NLI datasets used in Appendix E
- out_domain_bench.csv: "Out of domain" datasets used in Appendix E.
- polnli_test_results.csv: The PolNLI test set with labels from all of the models tested in the paper.
- rand_terror.csv: RAND terror dataset used in Appendix E.
- results_matrix.csv: A Matrix of performance metrics for all models across all datasets in the PolNLI test set. Used to construct plots.
- stability_labels.csv: Labels on the PolNLI test set taken from 10 different checkpoints saved during training from each of the models. Used for Figure 10 in Appendix E.
- t4_timings.csv: Experiment results for T4 (google colab) benchmarks.
- ukp_stance.csv: Stance dataset used in Appendix E.
- ukp_topic.csv: Topic dataset used in Appendix E.
- wandb_loss.csv: Loss data exported from Weights and Biases while training the models.

# api_queries
This folder contains notebooks and scripts that must be run in a cloud environment or require and API key for a proprietary model. We provide these notebooks in the interest of transparency, but consider these portions of the paper to be non-reproducible because we cannot control the computing environments of cloud platforms or the versioning of proprietary LLMs.
- benchmark_time_T4.ipynb: Should be run on google colab. Used to get timing benchmarks for a T4 GPU.
- benchmark_claude.ipynb: Used to classify the PolNLI test set with Claude Sonnet 3.5. Requires an Anthropic API key.
- benchmark_llama70B.ipynb: Used to classify the PolNLI test set with Llama 3.3 70B. Requires a Groq API key.
- appendix_e3.py: This script does not require a proprietary model, but requires archived versions of all of the checkpoints saved during training DEBATE. Because these files are over 100GB, they can't reasonably be stored in a replication archive. The authors will be happy to provide these models though if replication of a script is required.